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Abstract 

This paper introduces an open-ended sequential algorithm for computing the p-value of a 
test using Monte Carlo simulation. It guarantees that the resampling risk, the probability of 
a different decision than the one based on the theoretical p-value, is uniformly bounded by 
an arbitrarily small constant. Previously suggested sequential or non-sequential algorithms, 
using a bounded sample size, do not have this property. Although the algorithm is open- 
ended, the expected number of steps is finite, except when the p-value is on the threshold 
between rejecting and not rejecting. The algorithm is suitable as standard for implementing 
tests that require (re-)sampling. It can also be used in other situations: to check whether 
a test is conservative, iteratively to implement double bootstrap tests, and to determine the 
sample size required for a certain power. 
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1 Introduction 

Consider a statistical test that rejects the null hypothesis for large values of a test statistic T. Having 
5_i ■ observed a realization t, one usually wants to compute the p-value 

p = P(T> t) 

where, ideally, P is the true probability measure under the null hypothesis. Of course, when the null 
hypothesis is composite, P is often estimated (parametrically or non-parametrically) . 

In many cases, e.g. for bootstrap tests, the p-value p cannot be evaluated explicitly. The usual remedy 
is a Monte Carlo test that essentially replaces p by 



X 



±£l{T 4 >t} 



n . 



where T%, . . . , T n are independent replicates of the test statistic T under P and 1 {} denotes the indicator 
function. 
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A reasonable requirement for a statistical method is what iGleser (1 19961 ) called the first law of applied 
statistics: "Two individuals using the same statistical method on the same data should arrive at the same 
conclusion." For a test, "conclusion" is whether it rejects or not, i.e. whether p is above or below a given 
threshold a (often a = 0.05). 

Monte Carlo tests do not satisfy this law: For an estimator p, let the resampling risk RR P (]5) be the 
probability that p and the true p are on different sides of the threshold a. More precisely, 



RR p (p) 



P p (p > a) if p < a, 
Pp(j5 < a) if p > a. 



The resampling risk RR p (p„ 
for n = 999, a = 0.1, and p 



) of the na i ve est imator p E 
0.11 (ljockelll984l . Table 1) 



sup RR p (j5 

r, 

P e[o,i] 



,i ve can be substantial, e.g. KR p (p na i ve ) = 0.146 
Furthermore, no matter how large n is chosen, 



> 0.5. 



In the present article, we introduce a recursively defined sequential algorithm which gives an estimator 
p of p that uniformly bounds the resampling risk as follows: 

sup RRp(p) < e 

P6[0,l] 

for some arbitrary (small) e > 0. Although the algorithm is open-ended, i.e. the number of steps is not 
bounded, the expected number of steps is finite for p ^ a. In particular, if p is far away from the threshold 
a then the algorithm usually stops quickly. 

Having reached step n without stopping, there exists an interval (with length going to as n — > oo) 
that contains the not yet available estimate p. This interval can be used as an interim result. 

It is wel l k nown, that Monte Ca r lo tes ts may lose power compared to the theoretical test, see e.g. 



Hope ( 19681 ) or Davison and Hinklev (1997, p. 155,156). Our algorithm bounds this loss of power by the 



arbitrarily small constant e. 

The proposed sequential algorithm can be used as standard implementation for (re-)sampling based 
tests in statistical software. Essentially, one only has to set e to a suitably small default value (e.g. 10~ 3 
or 10 -5 ), and ensure that the algorithm reports intermediate results until it finishes. An R package is 
available from the author's web page 

http : / / www . ma . ic . ac . uk/~agandy . 

Other sequential proce d ures to compute p-values have been suggested previously. The suggestion of 
Davidson and MacKinnon ( 2000l ) is relatively close to our algorithm. They also use a uniform bound 
on the resampling risk as motivation. However, their algorithm does not really guarantee this bound 
since, when deciding whether to stop, they do not take into account the problem of multiple testing. 
Furthermore, whereas we allow stopping after each step, they only allow stopping after 2 k B steps for 
k = 0, . . . , n, where B i s som e constant. 

Besae and Clifford! (|l99lh suggest a sequential procedure which stops if the partial sum Yl7=l ^{-^ — 



reaches a given threshold or if a given number of samples n is reached. The motivation is that if p is 
high, i.e. if the test result is far awa y from being significant, fewer replicates are needed than in the naive 
ap proach. More recently, iFav et a l. (2007) suggested using a truncated sequential probability ratio test. 
Andrews and Buchinskv ( 2000l . 2001 ) suggest using as criterion the relative difference between the 



p-value from the finite-sample bootstrap and the 'ideal' bootstrap using infinite sample size. This method 
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Figure 1: The algorithm stops if S n > U n or S n < L n The boundaries are computed for the threshold 
a = 0.05 using the spending sequence e n = iooo(iooo+n) • 



involves drawing some fixed number of bootstrap samples and then using asymptotic arguments to deter- 
mine the number of repetitions needed. Once the number of repetitions has been chosen, the test can be 
performed by drawing the remaining bootstrap repe t itions (witho ut further sequential cons iderations). 

Bayesian approaches have been suggested in Lai ( 19881 ) and in iFav and Follmannl (|2002l ). By putting 
a (prior) distribution on p, the average resampling risk E(RR p (p)) can be bounded. This bound is much 
weaker than our uniform bound on RR P . 

The present article is structured as follows: In Section [21 we precisely define the sequential algorithm 
and describe the key results of the paper. In Section [3l we comment on several aspects of our algorithm 
such as the expected number of steps, choice of tuning parameters, and details of the implementation. 

In Section [H we demonstrate the wide applicability of our sequential algorithm in a simple practical 
example. Proofs are relegated to the appendix. 



2 The Algorithm and Key Results 

Instead of considering independent replicates of I {T > t} one can obviously consider replicates from a 
Bernoulli distribution with parameter p. From now on, let X±, X2, ■ ■ ■ be independent and identically 
distributed Bernoulli distributed random variables with parameter p. In the notation of the introduction, 
Xi = II {Ti > t}. Expectations and probabilities are taken using the p that is indicated in a subscript 
(P p (0 and E p (0). 

Our sequential algorithm stops once the partial sum S n = Y2i=i Xi hits boundaries given by two 
integer sequences (U n ) n ^ and (L n ) n€ ^ with U n > L n , i.e. we stop after 

r = inf{£; G N : S k > U k or S k < L k } 

steps. In the above, N = {1,2, . . . }. Figure [H shows an example of sequences U n and L n resulting from 
the following definition. 

We construct U n and L n such that for p < a (resp. p > a) the probability of hitting the upper 
boundary U n (resp. lower boundary L n ) is at most e, where e > is the desired bound on the resampling 
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risk. We will see that it suffices to ensure that for p = a the probability of hitting each boundary is at 
most e. We use a recursive definition that for each n minimizes U n (resp. maximizes L n ) conditional on 

P a (hit upper boundary until n) = P a (r < n,S T > U T ) < e n and (1) 
P a (hit lower boundary until n) = P a (r < n,S T < L T ) < e n , (2) 

where e n is a non-decreasing sequence with e n — > e and < e n < e. We call e n spending sequence. The 
sequence e n is used to control how fast the allowed resampling risk e is spent. In the examples of this 
paper, we will use e n = for some constant k. There is a close connect ion of our spending function 
e n to the a-spending function (or "use" function) of Lan and DeMets ( 19831 ). 



The formal definition of the boundaries (U n ) and (L n ) is as follows: Let U\ = 2,Li = — 1 and, 
recursively for n S {2, 3, . . . }, let 

U n = mm{j G N : P a (r > n, S n > j) + P a (r < n, S T > U T ) < e„}, 
L n = max{j € Z : P Q (r >n,S n < j) + P q (t <n,S T < L T ) < e n }. 

Note that C/ ra (resp. L n ) is the minimal (resp. maximal) value for which ([T|) (resp. ([2])) holds true given 
U\, . . . , U n -\ and L%, . . . , £ n -i- Using induction, one can see that (pQ) and (|2|) hold true for all n. 

The following theorem shows that the expected number of steps of the algorithm is finite for p ^ a 
and that the probability of hitting the "wrong" boundary is bounded by e. 

Theorem 1. Suppose that e < 1/2 and log(e n — e n _i) = o{n) as n — > oo. Then U n — an = o(n), 
an — L n = o(n) and E p (r) < oo for all p ^ a. Furthermore, 

sup P p (r < oo, S T > U T ) < e and sup P p (t < oo, S T < L T ) < e. (4) 
p£[0,a] pe(a,l] 

The proof of this and the next theorem can be found in the appendix. 
As estimator for p we use the maximum likelihood estimator 

r < oo, 
r = oo. 

Figure [2] shows how the estimator depends on when the boundary is hit. 
The next theorem gives the uniform bound on the resampling risk. 

Theorem 2. Suppose e < 1/4 and log(e n — e n _i) = o(n) as n — > oo. Then 

sup KR p (p) < e. 
pe[o,i] 

Note that our default spending sequence e n = e^r^ satisfies the conditions of the above theorems. 
The conditions in the above two theorems are not minimal. For example, considering whether to stop 
only every v S N steps will, using a slightly modified proof, lead to the same results. 
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Figure 2: If the upper boundary is hit after r steps then p = U t /t, if the lower boundary is hit then 
p = L t /t. We used a = 0.05 and e n = iooo(iooo +n ) ■ Note the log-scale on the horizontal axis. 



3 Remarks 

3.1 Lower Bound on the Expected Number of Steps 

Suppose p is a sequential algorithm with stopping time r that has a uniformly bounded resampling risk 

sup RR p (p) < e. 
pe[o,i] 

We derive a lower bound on E p (r) and show that in a Bayesian setup the expected number of steps is 
infinite. 

For pq > a consider the hypotheses Hq : p = a against H\ : p = pq. We can construct a test by 
rejecting Hq iff p > a. The probability of both the type I and the type II e rror is e. 

Consider the sequential probability ratio test (SPRT), see Wald ( 19451 ). of Hq against H\ with the 
same error probabilities. Let a Po denote its stopping time. The SP RT minimizes the expected number of 
steps among all sequential tests with the same error probabilities ( Wald and Wolfowitz . 19481 ). Thus, 



Ep (r) > E po (a po ) 



(l-e)log(i^)+elog(^ 
Pologe) + (1-Po)logfe 



(5) 



wh ere the appro ximation is from ( Waldl . 19451 . (4.8)). The approximation can be replaced by an inequality 
Waldl (| 19451 . (4.13), (4.15) or (4.16)). 



via 



Equation © also holds true for po < Indeed, one only needs to replace the above Hq by Hq : p = 
a + 5 for some 5 > and let S — ► 0. 

Suppose, in a Bayesian sense, that p is random, having distribution function F with derivative F'(a) > 
0. Then for some c > and some 5 > 0, 



E(r) 



E p (r)dF(p) >c (plog (|) + (1 - P) log 



dp = co 



The last equality holds since the integrand is proportional to 2a (1 — a)(p — a) 2 as p — > a (by e.g. 
l'Hospital's rule),. 
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Figure 3: Top: Expected number of steps E p (t) of the algorithm against the true parameter p. Bottom: 
E p (r) divided by the theoretical lower limit in ([5|). The threshold a = 0.05 and the spending sequence 

e n = e Tofc were used - 



Suppose we want to use our algorithm to compute the power or the level of a test. Then p is indeed 
random and E(r) = oo. Thus for this application we have to truncate our stopping time (e.g. by some 
deterministic constant). 

3.2 Error Bound and Spending Sequence 

Figure [3] illustrates the dependence of the expected number of steps E p (r) on the true p and on the error 
bound e. For most p, the algorithm stops quite quickly. Furthermore, the dependence on the bound e of 
the resampling risk is only slight, so e can be chosen small in 

The lower part of Figure [3] shows that our algorithm with the default spending sequence is not too 
far away from the theoretical boundary. Can this be improved by choosing a different e n ? What should 
"improved" mean? There is no obvious optimality criterion. As Section 13.11 shows, a criterion like the 
average number of steps under the null hypothesis cannot be used since it is always infinite. An option 
is to try to minimize something like Jq 1 E p (r)/E p (cTp)d!p, i.e. integrating the function plotted in the lower 
part of Figure [3l However, pursuing this further is beyond the scope of this article. 

In a small study, not reported here, we have looked at other choices besides our default e n = epr^. 
The main conclusion is that the choice of e n does not seem to have a big influence - as long as the allowed 
error is spent at a sub-exponential rate (satisfying the conditions in Theorems [U and [2|) . 
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3.3 Bounds on the Estimator Before Stopping 



In practice, the algorithm should report back after a fixed number of steps, even if it has not stopped yet. 
After this intermediate stop one can continue the algorithm. 

In the case of such an intermediate stop one can compute an interval in which p will eventually lie. One 
can base this interval on the inequalities ([9|) and (fTUI) in the appendix. Indeed, after n steps, conditional 
on r > n, one gets p G [a — a + for r > n, where A n = — nlog(e„ — e n _i)/2. Note that 

under the assumptions of Theorem Q] we have (A n + l)/n — > 0. 

These bounds are not very tight. They can be improved as follows. Conditional on not having stopped 
after n steps, we have 

■ ^ * ^ u v 
mm — < p < max — (o) 

u>n V u>n V 

(in the proof of Theorem [2] we show U v > va > L v ). Figure [2] shows a plot of yJJ n and j L L n for one 
particular spending sequence. It seems that ^U u is overall decreasing, and that ^L u is overall increasing. 
This also seems to be true for further spending sequences of the type e n = e^^- Thus an ad-hoc way 
of computing the upper bound in (|6|) is by evaluating max. n < v < n+ll -j- where \i is chosen suitably, e.g. 
ji = 1/ ol. A similar argument applies, of course, to the lower bound of ©. 

Instead of reporting back after a fixed number of steps, the algorithm could also report back after a 
certain computation time, e.g. one minute. This has the advantage that a sensible default value can be 
used irrespective of the time one sampling step takes. 



3.4 Confidence Intervals 



Confidence intervals for p can be constructed similarly to Armitage ( 19581 ). Suppose the algorithm stops 



and returns p obs as result. Then a 1 — confidence interval for p is given by [p, p] , where p = for p obs = 0, 
p = 1 for p obs = 1 , and otherwise 



P p (p>p obs )=P/2, V v (p<p obs ) = l-(3/2. 



Armitage ( 19581 ) showed that the probabilities on the left hand sides are strictly monotonic in p and p 



and thus p and p are well-defined. 

One can compute p and p numerically. If the computation of the above probabilities involves an infinite 
sum we consider the complement event instead, e.g. we may replace P p (p > p obs ) by 1 — P p (p < p obs ). 

Suppose the algorithm has not stopped, i.e. r > n. Then by the arguments of the previous subsection 
we get p min , ]5 max such that p G [j5 min ,p max ]. Replacing p obs in the definition of p (resp. p) by p min (resp. 
p max ) produces an interval that includes the confidence interval one gets once the algorithm has finished. 
Thus this is a confidence interval itself, with a (slightly) increased coverage probability. 



3.5 Implementation Details 

To compute U n and L n via ([3]), one needs to know the distribution of S n given r > n as well as P a (r < 
n,S T > U T ) and P a (r < n, S T < L T ). These quantities can be updated recursively. Furthermore, the 
amount of memory required to store these quantities is proportional to U n — L n . 

What is the additional computational effort for the sequential procedure? The main effort at each 
step is to compute the distribution of S n given r > n from the distribution of S n -\ given r > n — 1. This 
effort is proportional to U n — L n . Hence, if the sequential procedure stops after n steps the computational 
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effort is roughiy proportional to Ya=i Wi — Li\. In order to get an idea of how big £/, 
a specific example in Figure HI In this example it seems as if U n — L n ~ yjn lo; 
algorithm can be removed through precomputation of L n and U n . 



n. 



- L n is we considered 
The overhead of the 




1e+06 



Figure 4: U n — L n seems to be roughly proportional to \Jn log n. We used a = 0.05 and e n = iooo(iooo+n) ■ 
Note the log-scale on the horizontal axis. 

Our sequential procedure can be easily parallelized, e.g. by distributing the generation of the samples. 



3.6 Using the Algorithm as a Building Block 

Our procedure can be used as a building block in more complicated computations. For example, one can 
use the sequential procedures in this paper to estimate the power of a resampling based test, by using 
the algorithm in the "inner" loop. Because of the problems mentioned in Section I3-H the number of 
replications in the inner loop have to be restricted by a constant. Of course, this is rather ad-hoc, but 
it should give a similar performance (with less computational effort) than the naive approach of nesting 
two loops within one another. 

For the problem of computing the power of a bootstrap test, some dedicated algorithms exists, such 
as that suggested bv lBoos and Zhan 3 l|200Cl ). Their algorithm can be combined with ours by using our 
sequential procedure in the inner loop. 

To compute t he power of a b ootstrap test, Jennison ( 1992I ) has suggested a sequential procedure for 
the "inner" loop. Jennison ( 1992I ) uses an approximation to bound the probability of deciding differently 
than the bootstrap that uses only a fixed number of samples. In contrast to that, the present article 
bounds the probability of deciding differently than the "ideal" bootstrap based on an infinite sample size. 

Furthermore, the algorithm can be used iteratively, e.g. for double bootstrap tests. Examples can be 
found in Section HI 



4 Applications 

Th is section demon s trates the wide applicability of our algorith m in a simple examp l e, alr eady used 
by iMehta and Pate by iNewton and Geyerl and by lDavison and Hinkleyi (|1997l . Example 

4.22). Suppose 39 observations have been categorized according to two categorical variables resulting in 
counts given by the following two-way sparse contingency table: 
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12 2 110 1 

2 2 3 

1112 7 3 

1 1 2 1 
11110 

Let A = (dij) denote this matrix. 

Consider the test of the null hypothesis that the two variables are independent which rejects for large 
values of the likelihood ratio test statistic 

T(A) = 2^2<iijlog(aij/hij), 

where hij = Yl v a uj Ylu a i^l a vv ^ * s wen known that under the null hypothesis, as the sample sizes 
increases, the distribution of T(A) converges to a ^-distribution with (7 — 1)(5 — 1) = 24 degrees of 
freedom. Applying this test to the above matrix leads to a p- value of 0.031. 



4.1 Parametric Bootstrap 

Since the contingency table A is sparse, the asymptotic approximation may be poor. To remedy this, 



Davison and Hinklevl (jl997l ) suggested a parametric bootstrap that simulates under the null hypothesis 
based on the row and column sums of A. 

Using the naive test statistic p na ive with n = 1,000 replicates results in a p-value of 0.041. This is 
below the usual threshold of 5 thus the test would be interpreted as significant. However, as further 
computations show, the probability of reporting a p-value larger than 5 

Next, we applied our algorithm using a = 0.05, e = 10~ 3 and e n = e 100 Q +n • We shall use this e and 
e n in all other examples of Section SJ Assume that we decide to let our algorithm run for at most 1,000 
steps initially. Not having reached a decision, the algorithm tells us that the final estimate will be in the 
interval []5 m i n , j5 max ] = [0.027,0.080]. Our algorithm finally stops after 8,574 samples, reporting a p- value 
of 0.040. The advantage of our algorithm is that we can be (almost) certain that the ideal bootstrap 
would also return a significant result. 



4.2 Some Notation 

To describe further uses of our algorithm we introduce the following notation. Let h a be the function 
that applies our algorithm with the threshold a to a sequence with elements in {0, 1} and returns the 
resulting estimate p. If the sequence is finite, say of length n, and the algorithm has not stopped after n 
steps then h a simply returns the current estimate S n /n. 

With this, the above use of our algorithm for the parametric bootstrap can be written as 

^o.o5(i{r(A)>r(^)h eN ) 5 

where A\,A2,... denote independent samples under the null hypothesis estimated from the row and 
column sums of the matrix A. 
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4.3 Checking the Level of Tests 

As mentioned earlier, the asymptotic ^-distribution may not be a good approximation because the 
observed matrix A is relatively sparse. We can use our algorithm to check whether the test is conservative 
or liberal. To do this at the 5 we estimate the rejection rate by 

K (l{T(A t )> x 2 2Afi . 95 } ieN ), 

where X24 0.95 denotes the 0.95 quantile of the ^-distribution with 24 degrees of freedom. We start our 
procedure with a threshold of a = 0.05. It stops after 1,437 steps and reports a p-value of 0.074. Hence, 
the test based on the asymptotic distribution seems to be liberal. 

How liberal is it? To find out whether the rejection rate is above 0.07 we start our procedure with 
a threshold of a = 0.07. After 66,736 samples, the estimated rejection rate is 0.075. Thus it is (almost) 
certain that the test at the nominal level 5 

Next, we check whether the parametric bootstrap test of Section f4.il does any better. For this we use 
the sequential procedure iteratively and compute the rejection rate by 

K (l{h .o5 (im^y)) > , < °- 05 } 4eN ) ' 

where for each i G N, An,Ai2, ■ ■ ■ denote independent samples under the null hypothesis estimated from 
the matrix Ai. 

As explained at the end of Section \3. 11 in the "inner" use of /10.05 we need to stop after a finite number 
M of steps. For the following we use M = 250. Setting a = 0.05, the outer algorithm stops after 264 
steps yielding a p-value of 0.114. Using a = 0.07 after 1,769 steps we get a p-value of 0.096. Hence, the 
bootstrap test seems to be quite liberal as well. 

For a = 0.05 (resp. a = 0.07) we generated a total of 21,250 (resp. 131,552) samples in the inner loop. 
A naive alternative consists of just two nested loops. To get a similar precision one could use M steps 
in the inner loop and 1,000 steps in the outer loop. For this, 250,000 samples need to be generated, far 
more than in our nested sequential algorithm. 



4.4 Double Bootstrap 

Davison and Hinklev (jl99l Example 4.22) suggest that the parametric bootstrap could be improved by 



using a double boots t rap. T he double bootstrap employs two loops that are nested within one another. 
Davison and Hinkleyl (119971 ) suggest that a sensible choice would be to use roughly 1,000 steps in the 
outer loop and 250 steps in the inner loop. As the classical double bootstrap needs to resample once 
before starting the inner loop, it needs 251,000 resampling steps. 

To reduce the number of steps, we can use our algorithm iteratively: First, compute the p-value from 
the parametric bootstrap using, say, 10,000 samples by 



1 



P 



10,000 



10,000 

£ i{r(^)>T(A)}. 

i=l 



After that, we compute the p-value of the double bootstrap by 
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Applying this with M = 250, the outer algorithm stops after 1,117 samples in the outer loop and 
returns a p-value of 0.078, which, in contrast to the previous tests, is not significant at the 5 In the inner 
loop we used only 77,405 samples. Adding the 10,000 samples needed to compute p and the 1,117 samples 
from the null model fitted to A the total number of samples generated is 88,522. This compares favorably 
to the 251,000 for the classical double bootstrap. 

To check the level of the double bootstrap test we can combine the approach of Section 14.31 with the 
approach of the current subsection. This results in iterating the procedure three-times. For the double 
bootstrap we set M = 250 and stop the outer algorithm after 500 steps. If we check whether the true level 
of our algorithm at the asymptotic level 5 and reports a p-value of 0.050. Hence the double bootstrap 
seems to be less liberal (if it is liberal at all) than the asymptotic test or the simple parametric bootstrap. 
In the innermost application of our algorithm we needed 12,688,117 resampling steps. The naive approach 
with a similar maximal number of steps for the inner loops and 1,000 steps for the outer loop would have 
used 1,000 • 500 • 250 = 1.25 • 10 8 steps in the innermost loop, more than 9 times the number of samples 
our iterated algorithm needed. 

4.5 Determining Sample Size 

In Section 14.31 we have seen how to use our algorithm to check the level of a test. Similarly, with the 
obvious modification of generating Ai from the given alternative, one can check whether a test achieves a 
desired power for a given sample size. Furthermore, the the minimal sample size that achieves a certain 
power can be found by combining our algorithm with e.g. a bisectioning algorithm. 

5 Conclusions 

We presented a sequential procedure to compute p-values by sampling. When the algorithm stops one has 
the "peace of mind" that, up to a small error probability, the p- value reported by the procedure is on the 
same side of some threshold as the theoretical p- value. In other words, the resampling risk is uniformly 
bounded by a small constant. If the algorithm has not stopped then one can give an interval in which the 
final estimate will be. 

The basic algorithm can also be used in several other situations. It can be used to check whether a 
test is conservative or liberal, in can be used iteratively for double bootstrap test, and it can be used to 
determine the sample size needed to achieve a certain power. 
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A Proofs 

The following lemma is needed in the proof of Theorem [U 
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Lemma 3. For < p < q < 1, 

P p (r < oo, 5 r > J7 T ) < P g (r < oo, S T > U T ), 
P p (r < oo, S T < L T ) > P g (r < oo, S T < L T ). 

Proof. Let V\ , V2, ... be independent random variables with a uniform distribution on [0,1] under the 
probability measure P*. For x £ [0, 1], let 

n 

s x , n = Y J nvi<x}. 

1=1 

Clearly 5 P) „ < S q>n . Let r x = inf{£; G N : > C4 or < L k }. Then 

P p (r < n, S T > U T ) = P*(t p < n, S PjTp > U Tp ) 

(t) (7) 
< P*(r g < n, S qiTq >U Tq ) = P 9 (r < n, S T > U T ). 

To see (f) one can argue as follows: Suppose t p < n and S P<T > U Tp . Then S q>Tp > S PtTp > U Tp . Hence, 
Tq < t p . Furthermore, for all k < t p we have S q ^ > S p ^ > L k . Hence, S q>Tq > U Tq . Letting n — ► oo in © 
finishes the proof of the first inequality. The second inequality can be shown similarly. □ 

Proof of Theorem d Suppose L n > U n for some ft. Then, by the definition of U n cind. Lni 
1 = Pq(t < n) < P a {S T >U T ,T<n)+ P a (S T <L T ,r<n)< 2e n < 2e < 1, 
which is a contradiction. Hence, U n > L n . 



Let j n = \A n + na] , where A n = y/— nlog(e n — e n _i)/2 and \x] denote s the sma ll est in teger greater 
th an x. We s how U n < j n . By a special case of Hoeffding's inequality, see lOkamoto Theorem 1) 

Hoeffdinal (|l963l . Theorem 1), 



or 



P a (S n > j n , T>n)< P a (S n > j n ) = P a (S n /n - a> j n /n - a) 
< exp (-2n(j n /n - a) 2 ) 



(8) 

A x 



< exp y-2n j J = e n - e n _i. 

By the definition of U n -i, 

P a (S T >U T ,T<n) = P a (S n -i > U n -i,r = n- 1) + P a (S T > U T , r < n - 1) < e n _i. 
This together with the definition of U n and ([8]) yields t7 n < j n . Thus, 

< > (n — > oo) (9) 

n n 

Similarly, one can show 

L n - na A n + 1 . 

> > (n — > oo). (10) 

n n 



12 



Together with U n > L n we get U n — not = o(n) and na — L n = o(n). 
Next, we show E p (r) < oo for p > a. For large n (say n > no), 

p = (p — a) — (— — a) = (p — a) + o(l) > (p — a)/2. 

n n 

For n > no, since t7 n /n — p < 0, Hoeffding's inequality shows 

P p (r > n) < P p (S n < U n ) = P p -p < ^ -p) < exp ^-2n ^ -p) V 
Thus P p (r > n) < exp(— n(p — a) 2 /2) for n > no- Hence, 

{•CO f'OG 

^p( t ) = / Pp( r > n)dn < no + / exp(— n(p — a) 2 /2)dn 

JO J n 

2 2 

="-0 + 7 To exp(-n (p - a) /2) < oo. 

(p — a) z 

Similarly, one can show E p (r) < oo for p < a. 

To see ([!]): Let p < a. By the previous lemma we have P p (t < oo, S T > U T ) < P Q (r < oo, S T > U T ). 
Since e„ < e, equation ([I]) implies P a (r < oo, 5 T > U T ) < e. Thus P p (r < oo, SV > C/ T ) < e. The second 
part of (U]) can be shown similarly. □ 

Proof of Theorem^ By @ it suffices to show that r < oo, S T > U T implies p > a and that r < oo, 
S T < L T implies p < a. 

First, we show L n < na. Suppose L n > na. Let c = \na]. Hence, by ( Uhlmann . 19661 . Satz 6), if 



c < — 

u — 2 ' 



^<P^_(S n <c)<P Q (5 n <c) 

Z n 1 



since ^ > ^ = a + ^ > a; and if c > 

1 



„ <P=±l(5„ <c) <P«(5 n <c). 

Z n+l 



since ^>m±l = a + ^>a. Hence, 



^ < P*(S n <c)< P a (S n < L n ) < P a (r < n) < 2e n < ± 
which is a contradiction. U n < na leads to a contradiction in a similar way. □ 



References 

Andrews, D. W. K. and Buchinsky, M. (2000). A three-step method for choosing the number of bootstrap 
repetitions. Econometrica, 68(1):23-51. 

Andrews, D. W. K. and Buchinsky, M. (2001). Evaluation of a three-step method for choosing the number 
of bootstrap repetitions. Journal of Econometrics, 103(l-2):345-386. 



13 



Armitage, P. (1958). Numerical studies in the sequential estimation of a binomial parameter. Biometrika, 
45(1/2):1-15. 

Besag, J. and Clifford, P. (1991). Sequential Monte Carlo p-values. Biometrika, 78(2):301-304. 

Boos, D. D. and Zhang, J. (2000). Monte Carlo evaluation of resampling-based hypothesis tests. Journal 
of the American Statistical Association, 95(450) :486-492. 

Davidson, R. and MacKinnon, J. G. (2000). Bootstrap tests: How many bootstraps? Econometric 
Reviews, 19(l):55-68. 

Davison, A. and Hinkley, D. (1997). Bootstrap methods and their application. Cambridge University 
Press. 

Fay, M. P. and Follmann, D. A. (2002). Designing Monte Carlo implementations of permutation or 
bootstrap hypothesis tests. American Statistician, 56(l):63-70. 

Fay, M. P., Kim, H.-J., and Hachey, M. (2007). On using truncated sequential probability ratio test 
boundaries for Monte Carlo implementation of hypothesis tests. Journal of Computational & Graphical 
Statistics, 16:946 - 967. 

Gleser, L. J. (1996). Comment on Bootstrap Confidence Intervals by T. J. DiCiccio and B. Efron. Statistical 
Science, 11:219-221. 

Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the 
American Statistical Association, 58(301): 13-30. 

Hope, A. C. A. (1968). A simplified Monte Carlo significance test procedure. Journal of the Royal 
Statistical Society. Series B (Methodological), 30(3):582-598. 

Jennison, C. (1992). Bootstrap tests and confidence intervals for a hazard ratio when the number of 
observed failures is small, with applications to group sequential survival studies. In Page, C. and 
LePage, R., editors, Computing Science and Statistics, Proc. 22nd Symposium Interface, pages 89-97, 
New York. Springer. 

Jockel, K. H. (1984). Computational aspects of Monte Carlo tests. In Proceedings of COMPSTAT 84, 
pages 185-188, Vienna. International Association for Statistical Computing, Physica-Verlag. 

Lai, T. L. (1988). Nearly optimal sequential tests of composite hypotheses. The Annals of Statistics, 
16:856-886. 

Lan, K. K. G. and DeMets, D. L. (1983). Discrete sequential boundaries for clinical trials. Biometrika, 
70:659-663. 

Mehta, C. R. and Patel, N. R. (1983). A network algorithm for performing fisher's exact test in r x c 
contingency tables. Journal of the American Statistical Association, 78(382) :427-434. 

Newton, M. A. and Geyer, C. J. (1994). Bootstrap recycling: A Monte Carlo alternative to the nested 
bootstrap. Journal of the American Statistical Association, 89(427) :905-912. 



14 



Okamoto, M. (1958). Some inequalities relating to the partial sum of binomial probabilities. Annals of 
the Institute of Statistical Mathematics, 10:29-35. 

Uhlmann, W. (1966). Vergleich der hypergeometrischen mit der binomial-verteilung. Metrika, 10(1):145— 
158. 

Wald, A. (1945). Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 
16(2):117-186. 

Wald, A. and Wolfowitz, J. (1948). Optimum character of the sequential probability ratio test. The 
Annals of Mathematical Statistics, 19(3):326-339. 



15 



